Query Planner - Teiid 8.8

For each sub-command in the user command an appropriate kind of sub-planner is used (relational, XML, procedure, etc).

Each planner has three primary phases:

Generate canonical plan
Optimization
Plan to process converter - converts plan data structure into a processing form

Relational Planner

The GenerateCanonical class generates the initial (or “canonical” plan). This plan is based on the typical logical order that a SQL query gets executed. A SQL select query has the following possible clauses (all but SELECT are optional): SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT. These clauses are logically executed in the following order:

FROM (read and join all data from tables)
WHERE (filter rows)
GROUP BY (group rows into collapsed rows)
HAVING (filter grouped rows)
SELECT (evaluate expressions and return only requested columns)
INTO
ORDER BY (sort rows)
LIMIT (limit result set to a certain range of results)
These clauses translate into the following types of planning nodes:

FROM: Source node for each from clause item, Join node (if >1 table)

WHERE: Select node

GROUP BY: Group node

HAVING: Select node

SELECT: Project node and DupRemoval node (for SELECT DISTINCT)

INTO: Project node with a SOURCE Node

ORDER BY: Sort node

LIMIT: Limit node

UNION, EXCEPT, INTERSECT: SetOp Node

There is also a Null Node that can be created as the result of rewrite or planning optimizations. It represents a node that produces no rows

Relational optimization is based upon rule execution that evolves the initial plan into the execution plan. There are a set of pre-defined rules that are dynamically assembled into a rule stack for every query. The rule stack is assembled based on the contents of the user’s query and its transformations. For example, if there are no view layers, then RuleMergeVirtual, which merges view layers together, is not needed and will not be added to the stack. This allows the rule stack to reflect the complexity of the query.

Logically the plan node data structure represents a tree of nodes where the source data comes up from the leaf nodes (typically Access nodes in the final plan), flows up through the tree and produces the user’s results out the top. The nodes in the plan structure can have bidirectional links, dynamic properties, and allow any number of child nodes. Processing plan nodes in contrast typical have fixed properties, and only allow for binary operations - due to algorithmic limitations.

Below are some of the rules included in the planner:

RuleRemoveSorts - removes sort nodes that do not have an effect on the result. This most common when a view has an non-limited ORDER BY.

RulePlaceAccess - insert an Access node above every physical Source node. The source node represents a table typically. An access node represents the point at which everything below the access node gets pushed to the source. Later rules focus on either pushing stuff under the access or pulling the access node up the tree to move more work down to the data sources. This rule is also responsible for placing Federated Optimizations#Access Patterns.

RulePushSelectCriteria - pushes select criteria down through unions, joins, and views into the source below the access node. In most cases movement down the tree is good as this will filter rows earlier in the plan. We currently do not undo the decisions made by PushSelectCriteria. However in situations where criteria cannot be evaluated by the source, this can lead to sub optimal plans.

One of the most important optimization related to pushing criteria, is how the criteria will be pushed trough join. Consider the following plan tree that represents a subtree of the plan for the query "select ... from A inner join b on (A.x = B.x) where A.y = 3"

    SELECT (B.y = 3)
           |
          JOIN - Inner Join on (A.x = B.x
         /     \    
      SRC (A)   SRC (B)

SELECT nodes represent criteria, and SRC stands for SOURCE.

It is always valid for inner join and cross joins to push (single source) criteria that are above the join, below the join. This allows for criteria originating in the user query to eventually be present in source queries below the joins. This result can be represented visually as:

          
    JOIN - Inner Join on (A.x = B.x)
          /    \
         /   SELECT (B.y = 3)
        |        |
      SRC (A)   SRC (B)

The same optimization is valid for criteria specified against the outer side of an outer join. For example:

     SELECT (B.y = 3) 
           |
          JOIN - Right Outer Join on (A.x = B.x)
         /     \    
      SRC (A)   SRC (B)

Becomes

          JOIN - Right Outer Join on (A.x = B.x)
          /    \
         /   SELECT (B.y = 3)
        |        |
      SRC (A)   SRC (B)

However criteria specified against the inner side of an outer join needs special consideration. The above scenario with a left or full outer join is not the same. For example:

      SELECT (B.y = 3)
           |
          JOIN - Left Outer Join on (A.x = B.x)
         /     \    
      SRC (A)   SRC (B)

Can become (available only after 5.0.2):

    JOIN - Inner Join on (A.x = B.x)
          /    \
         /   SELECT (B.y = 3)
        |        |
      SRC (A)   SRC (B)

Since the criterion is not dependent upon the null values that may be populated from the inner side of the join, the criterion is eligible to be pushed below the join – but only if the join type is also changed to an inner join.
On the other hand, criteria that are dependent upon the presence of null values CANNOT be moved. For example:

    SELECT (B.y is null)
           |
          JOIN - Left Outer Join on (A.x = B.x)
         /     \   
      SRC (A)   SRC (B)

This plan tree must have the criteria remain above the join, since the outer join may be introducing null values itself. This will be true regardless of which version of Teiid is used.

RulePushNonJoinCriteria - this rule will push criteria out of an on clause if it is not necessary for the correctness of the join.

RuleRaiseNull - this rule will raise null nodes to their highest possible point. Raising a null node removes the need to consider any part of the old plan that was below the null node.

RuleMergeVirtual - merges view layers together. View layers are connected by nesting canonical plans under source leaf nodes of the parent plan. Each canonical plan is also sometimes referred to as a “query frame”. RuleMergeVirtual attempts to merge child frames into the parent frame. The merge involves renaming any symbols in the lower frame that overlap with symbols in the upper frame. It also involves merging the join information together.

RuleRemoveOptionalJoins - removes optional join nodes form the plan tree as soon as possible so that planning will be more optimal.

RulePlanJoins - this rule attempts to find an optimal ordering of the joins performed in the plan, while ensuring that Federated Optimizations#Access Patterns dependencies are met. This rule has three main steps. First it must determine an ordering of joins that satisfy the access patterns present. Second it will heuristically create joins that can be pushed to the source (if a set of joins are pushed to the source, we will not attempt to create an optimal ordering within that set. More than likely it will be sent to the source in the non-ANSI multi-join syntax and will be optimized by the database). Third it will use costing information to determine the best left-linear ordering of joins performed in the processing engine. This third step will do an exhaustive search for 6 or less join sources and is heuristically driven by join selectivity for 7 or more sources.

RuleCopyCriteria - this rule copies criteria over an equality criteria that is present in the criteria of a join. Since the equality defines an equivalence, this is a valid way to create a new criteria that may limit results on the other side of the join (especially in the case of a multi-source join).

RuleCleanCriteria - this rule cleans up criteria after all the other rules.

RuleMergeCriteria - looks for adjacent criteria nodes and merges them together. It looks for adjacent identical conjuncts and removes duplicates.

RuleRaiseAccess - this rule attempts to raise the Access nodes as far up the plan as possible. This is mostly done by looking at the source’s capabilities and determining whether the operations can be achieved in the source or not.

RuleChooseDependent - this rule looks at each join node and determines whether the join should be made dependent and in which direction. Cardinality, the number of distinct values, and primary key information are used in several formulas to determine whether a dependent join is likely to be worthwhile. The dependent join differs in performance ideally because a fewer number of values will be returned from the dependent side. Also, we must consider the number of values passed from independent to dependent side. If that set is larger than the max number of values in an IN criteria on the dependent side, then we must break the query into a set of queries and combine their results. Executing each query in the connector has some overhead and that is taken into account. Without costing information a lot of common cases where the only criteria specified is on a non-unique (but strongly limiting) field are missed. A join is eligible to be dependent if:

- there is at least one equi-join criterion, i.e. tablea.col = tableb.col
- the join is not a full outer join and the dependent side of the join is on the inner side of the join

The join will be made dependent if one of the following conditions, listed in precedence order, holds:

- There is an unsatisfied access pattern that can be satisfied with the dependent join criteria
- The potential dependent side of the join is marked with an option makedep
- (4.3.2) if costing was enabled, the estimated cost for the dependent join (5.0+ possibly in each direction in the case of inner joins) is computed and compared to not performing the dependent join. If the costs were all determined (which requires all relevant table cardinality, column ndv, and possibly nnv values to be populated) the lowest is chosen.
- If key metadata information indicates that the potential dependent side is not “small” and the other side is “not small” or (5.0.1) the potential dependent side is the inner side of a left outer join.

Dependent join is the key optimization we use to efficiently process multi-source joins.
Instead of reading all of source A and all of source B and joining them on A.x = B.x, we read all of A then build a set of A.x that are passed as a criteria when querying B. In cases where A is small and B is large, this can drastically reduce the data retrieved from B, thus greatly speeding the overall query.

RuleChooseJoinStrategy - Determines the base join strategy. Currently this is a decision as to whether to use a merge join rather than the default strategy, which is a nested loop join. Ideally the choice of a hash join would also be evaluated here. Also costing should be used to determine the strategy cost.

RuleDecomposeJoin - this rule perfomrs a partition-wise join optimization on joins of Federated Optimizations#Partitioned Union. The decision to decompose is based upon detecting that each side of the join is a partitioned union (note that non-ansi joins of more than 2 tables may cause the optimization to not detect the appropriate join). The rule currently only looks for situations where at most 1 partition matches from each side.

RuleCollapseSource - this rule removes all nodes below an Access node and collapses them into an equivalent query that is placed in the Access node.

RuleAssignOutputElements - this rule walks top down through every node and calculates the output columns for each node. Columns that are not needed are dropped at every node. This is done by keeping track of both the columns needed to feed the parent node and also keeping track of columns that are “created” at a certain node.

RuleValidateWhereAll - this rule validates a rarely used model option.

RuleAccessPatternValidation - validates that all access patterns have been satisfied.

RulePushLimit - pushes limit and offset information as far as possible in the plan.

Procedure Planner

The procedure planner is fairly simple. It converts the statements in the procedure into instructions in a program that will be run during processing. This is mostly a 1-to-1 mapping and very little optimization is performed.

XML Planner

The XML Planner creates an XML plan that is relatively close to the end result of the Procedure Planner – a program with instructions. Many of the instructions are even similar (while loop, execute SQL, etc). Additional instructions deal with producing the output result document (adding elements and attributes).

The XML planner does several types of planning (not necessarily in this order):

Document selection - determine which tags of the virtual document should be excluded from the output document. This is done based on a combination of the model (which marks parts of the document excluded) and the query (which may specify a subset of columns to include in the SELECT clause).

Criteria evaluation - breaks apart the user’s criteria, determine which result set the criteria should be applied to, and add that criteria to that result set query.

Result set ordering - the query’s ORDER BY clause is broken up and the ORDER BY is applied to each result set as necessary

Result set planning - ultimately, each result set is planned using the relational planner and taking into account all the impacts from the user’s query

Program generation - a set of instructions to produce the desired output document is produced, taking into account the final result set queries and the excluded parts of the document. Generally, this involves walking through the virtual document in document order, executing queries as necessary and emitting elements and attributes.

XML programs can also be recursive, which involves using the same document fragment for both the initial fragment and a set of repeated fragments (each a new query) until some termination criteria or limit is met.